Improve SQLite write speed, indefinite hang while analyzing large bundles (>1 GB)#66
Conversation
- Set DB connection journal mode to use Write-Ahea log to speed up insert transactions. - Fix buffer offset bugs in CRC checksum calculation, use fast CRC fingerprint for external file references.
There was a problem hiding this comment.
Pull request overview
Addresses a hang/performance cliff in UnityDataTools analyze on very large (>1 GB) Asset Bundles by reducing expensive CRC re-processing and improving SQLite write throughput.
Changes:
- Fix
UnityFileReader.ComputeCRCchunking/offset logic to avoid re-processing and incorrect sizing. - Speed up CRC handling of
StreamingInfo/StreamedResourceby using a lightweight external-stream fingerprint instead of reading large.resSblobs. - Attempt to switch SQLite to WAL journal mode to speed up insert-heavy workloads.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| UnityFileSystem/UnityFileReader.cs | Corrects CRC chunk iteration to advance offsets and handle final partial chunks properly. |
| Analyzer/PPtrAndCrcProcessor.cs | Replaces external resource reads with an offset/size/path fingerprint and fixes CRC sizing for managed-reference strings. |
| Analyzer/SQLite/Writers/SerializedFileSQLiteWriter.cs | Updates PPtrAndCrcProcessor construction to match new API; leaves an unused parameter. |
| Analyzer/SQLite/Writers/SQLiteWriter.cs | Adds a WAL pragma on connection open (but currently conflicts with Init.sql pragmas). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| using var walCommand = m_Database.CreateCommand(); | ||
| walCommand.CommandText = "PRAGMA journal_mode=WAL"; | ||
| walCommand.ExecuteNonQuery(); |
There was a problem hiding this comment.
PRAGMA journal_mode=WAL set here is later overridden by Resources.Init (Init.sql currently ends with PRAGMA journal_mode = MEMORY;), so the connection will not actually run in WAL mode. To make WAL effective, either move the WAL pragma to after executing Resources.Init, or remove/adjust the journal_mode pragma in Init.sql to not override the desired mode.
There was a problem hiding this comment.
It seems this is a good optimization to add for UnityDataTools, but according to the copilot analysis this doesn't actually apply it.
| public void WriteSerializedFile(string relativePath, string fullPath, string containingFolder) | ||
| { | ||
| using var sf = UnityFileSystem.OpenSerializedFile(fullPath); | ||
| using var reader = new UnityFileReader(fullPath, 64 * 1024 * 1024); | ||
| using var pptrReader = new PPtrAndCrcProcessor(sf, reader, containingFolder, AddReference); | ||
| using var pptrReader = new PPtrAndCrcProcessor(sf, reader, AddReference); |
There was a problem hiding this comment.
containingFolder is now unused after removing it from the PPtrAndCrcProcessor constructor call, which will trigger an unused-parameter warning and can mislead callers. Consider removing this parameter from WriteSerializedFile(...) (and updating its call sites) if it’s no longer needed.
| private static uint AppendExternalStreamFingerprint(uint crc32, long offset, int size, string filename) | ||
| { | ||
| var slashPos = filename.LastIndexOf('/'); | ||
| if (slashPos > 0) | ||
| { | ||
| filename = filename.Remove(0, slashPos + 1); | ||
| } | ||
|
|
||
| if (!m_resourceReaders.TryGetValue(filename, out var reader)) | ||
| { | ||
| try | ||
| { | ||
| reader = new UnityFileReader("archive:/" + filename, 4 * 1024 * 1024); | ||
| } | ||
| catch (Exception) | ||
| { | ||
| try | ||
| { | ||
| reader = new UnityFileReader(Path.Join(m_Folder, filename), 4 * 1024 * 1024); | ||
| } | ||
| catch (Exception) | ||
| { | ||
| Console.Error.WriteLine(); | ||
| Console.Error.WriteLine($"Error opening resource file {filename}"); | ||
| reader = null; | ||
| } | ||
| } | ||
|
|
||
| m_resourceReaders[filename] = reader; | ||
| } | ||
|
|
||
| return reader; | ||
| crc32 = Crc32Algorithm.Append(crc32, BitConverter.GetBytes(offset)); | ||
| crc32 = Crc32Algorithm.Append(crc32, BitConverter.GetBytes(size)); | ||
| crc32 = Crc32Algorithm.Append(crc32, Encoding.UTF8.GetBytes(filename)); |
There was a problem hiding this comment.
AppendExternalStreamFingerprint uses BitConverter.GetBytes(...), which is endianness-dependent. That makes the computed CRC vary on big-endian platforms even for identical data. Prefer writing offset/size in a fixed byte order (e.g., little-endian via BinaryPrimitives) before appending to the CRC to keep results stable across architectures.
| /// <summary> | ||
| /// Computes CRC32 over a contiguous byte range, reading the file in buffer-sized chunks. | ||
| /// </summary> | ||
| public uint ComputeCRC(long fileOffset, int size, uint crc32 = 0) | ||
| { | ||
| var readSize = size > m_Buffer.Length ? m_Buffer.Length : size; | ||
| var readBytes = 0; | ||
|
|
||
| while (readBytes < size) | ||
| var remaining = size; | ||
| while (remaining > 0) | ||
| { | ||
| var offset = GetBufferOffset(fileOffset, readSize); | ||
| crc32 = Crc32Algorithm.Append(crc32, m_Buffer, offset, readSize); | ||
| readBytes += readSize; | ||
| var chunk = (int)Math.Min((long)m_Buffer.Length, remaining); | ||
| var offset = GetBufferOffset(fileOffset, chunk); | ||
| crc32 = Crc32Algorithm.Append(crc32, m_Buffer, offset, chunk); | ||
| fileOffset += chunk; | ||
| remaining -= chunk; | ||
| } |
There was a problem hiding this comment.
ComputeCRC chunking logic was changed to fix offset/size handling; this is easy to regress without coverage. Add a unit test that verifies CRC results for ranges that cross the internal buffer boundary (e.g., size > buffer, and a final partial chunk) to lock in the corrected behavior.
SkowronskiAndrew
left a comment
There was a problem hiding this comment.
Thanks for submitting this proposed fix.
I think the nature of the fix needs to change a bit so that we don't lose the value of the CRC checks into those external streams.
But this is a good start - it sounds like we definitely need to do something to fix analyze for the data set you are looking at. There is no reason we shouldn't be able to calculate CRC across large data in a decent amount of time if we are efficient with RAM usage and access patterns. I wasn't familiar at all with this area of UnityDataTools but hope my suggests point in a useful direction.
| using var walCommand = m_Database.CreateCommand(); | ||
| walCommand.CommandText = "PRAGMA journal_mode=WAL"; | ||
| walCommand.ExecuteNonQuery(); |
There was a problem hiding this comment.
It seems this is a good optimization to add for UnityDataTools, but according to the copilot analysis this doesn't actually apply it.
|
|
||
| private UnityFileReader GetResourceReader(string filename) | ||
| /// <summary> | ||
| /// Extends CRC32 with a stable fingerprint for an external stream segment without reading blob bytes. |
There was a problem hiding this comment.
The CRC is stored in the database to represent the content of the object, including all external resource references like mesh and texture content. The CRC is the fingerprint for "has this asset changed?"
If a texture blob in a .resS file changes but stays the same size, the old code detects it (different bytes =
different CRC)
The new code gives an identical "fingerprint" (same offset + size + filename)
It sounds like there are performance problems in the existing calculation but i don't think this is the solution.
Probably there are two things to do:
-
Continue to calculate the CRC of .resS and .resource blobs, but make sure we do this in an efficient fashion. I'll take a bit more look at the existing code.
-
Offer a way to skip the CRC calculation if you don't need it. Currently you can turn it off with
--skip-references, but that is not intuitive, so there should be a separate--skip-crcand some careful work to make sure that the permutations of those flags work properly.
|
|
||
| private CallbackDelegate m_Callback; | ||
|
|
||
| private Dictionary<string, UnityFileReader> m_resourceReaders = new(); |
There was a problem hiding this comment.
I'm wondering if this existing implementation was very expensive in terms of allocating a lot of memory during the processing and never freeing it. Is it possible you were running out memory?
Apparently each UnityFileRead allocates 4MB and i guess it might grow based on the actual data size?
I'm guessing that the way to fix things will be to improve the management of these UnifyFileReaders.
I think these should be flushed between SerializedFiles - currently .resS and .resource files are NEVER shared, its
always pointing at another companion file with the same filename root.
So the only purpose of caching would be while processing different objects of the same serialized file (because they might reference different segments of the same file)
I recommend that you to take a look at the access pattern, e.g. using some logging to double check exactly what requests are coming in for the external streams in the case that was so slow.
Using
UnityDataTools analyzecommand on a target folder containing large (>1 GB) Asset Bundles can hang indefinitely. This is due to two issues:This PR addresses these issues with the following changes:
With these changes, processing large asset bundles goes from an indefinite (>24 hours) process to a 15 minute operation.